Data cleaning

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Introduction
  • Tidy data
  • Missing data
  • Join data
  • Codebooks

Introduction

Introduction

“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.

Source: R for Data Science

Introduction

Why do we need to clean data?

  • Data is not tidy.
  • Data is missing.
  • Data is many datasets.

Tidy data

Wickham rules

A df is tidy if it fulfills these requirements (Wickham 2014):

  • Each df has one unit of observation.
  • Observations represented in the rows.
  • Variables represented in the columns.
  • Each cell indicates a value.

Source: R for Data Science

A tidy dataframe

Correlates of War Intra-State War Data

# A tibble: 442 × 4
   WarName            WarType SideA          SideB                           
   <chr>                <dbl> <chr>          <chr>                           
 1 First Caucasus           5 Russia         Georgians, Dhagestania, Chechens
 2 Sidon-Damascus           6 Sidon          Damascus & Aleppo               
 3 First Two Sicilies       4 Austria        -8                              
 4 First Two Sicilies       4 Two Sicilies   Liberals                        
 5 Spanish Royalists        4 Spain          Royalists                       
 6 Sardinian Revolt         4 Austria        -8                              
 7 Sardinian Revolt         4 Sardinia       Carbonari                       
 8 Greek Independence       5 Ottoman Empire Greeks                          
 9 Greek Independence       5 -8             United Kingdom                  
10 Greek Independence       5 -8             France                          
# … with 432 more rows

An untidy dataframe

PEW Research Religious Landscape Study1

# A tibble: 18 × 6
   religion                `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
   <chr>                     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Agnostic                     27        34        60        81        76
 2 Atheist                      12        27        37        52        35
 3 Buddhist                     27        21        30        34        33
 4 Catholic                    418       617       732       670       638
 5 Don’t know/refused           15        14        15        11        10
 6 Evangelical Prot            575       869      1064       982       881
 7 Hindu                         1         9         7         9        11
 8 Historically Black Prot     228       244       236       238       197
 9 Jehovah's Witness            20        27        24        24        21
10 Jewish                       19        19        25        25        30
11 Mainline Prot               289       495       619       655       651
12 Mormon                       29        40        48        51        56
13 Muslim                        6         7         9        10         9
14 Orthodox                     13        17        23        32        32
15 Other Christian               9         7        11        13        13
16 Other Faiths                 20        33        40        46        49
17 Other World Religions         5         2         3         4         2
18 Unaffiliated                217       299       374       365       341

Untidy data

Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).

Table A

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Table B

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Table C

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Table D

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Pivoting data

We change the rows and columns of the dataframe keeping the same information.

Pivot wider

pivot_wider(df, names_from, values_from)

Pivot longer

pivot_longer(df, cols, names_to, names_to)

Separate

separate(df, col, into, sep)

Unite

unite(df, col, ..., sep)

Exercise

Check the following dataset:

  • Tidy the data.
?world_bank_pop
world_bank_pop |> 
  pivot_longer(`2000`:`2017`, names_to = "year", values_to = "vals") |> 
  pivot_wider(names_from = indicator, values_from = vals)

Pivoting dataframes

What does it mean?

Missing data

Join data

Codebooks

Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.